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BILEVEL PARAMETER LEARNING FOR HIGHER-ORDER 
TOTAL VARIATION REGULARISATION MODELS* 

J.C. DE LOS REYES\ C.-B. SCHONLIEB^ AND T. VALKONEN^ 


Abstract. We consider a bilevel optimisation approach for parameter learning 
in higher-order total variation image reconstruction models. Apart from the least 
squares cost functional, naturally used in bilevel learning, we propose and analyse 
an alternative cost, based on a Huber regularised TV-seminorm. Differentiability 
properties of the solution operator are verified and a first-order optimality system 
is derived. Based on the adjoint information, a quasi-Newton algorithm is pro¬ 
posed for the numerical solution of the bilevel problems. Numerical experiments 
are carried out to show the suitability of our approach and the improved perfor¬ 
mance of the new cost functional. Thanks to the bilevel optimisation framework, 
also a detailed comparison between TGV^ and ICTV is carried out, showing the 
advantages and shortcomings of both regularisers, depending on the structure of 
the processed images and their noise level. 


1. Introduction 

In this paper we propose a bilevel optimisation approach for parameter learning in 
higher-order total variation regularisation models for image restoration. The recon¬ 
struction of an image from imperfect measurements is essential for all research which 
relies on the analysis and interpretation of image content. Mathematical image re¬ 
construction approaches aim to maximise the information gain from acquired image 
data by intelligent modelling and mathematical analysis. 

A variational image reconstruction model can be formalised as follows. Given data 
/ which is related to an image (or to certain image information, e.g. a segmented or 
edge detected image) u through a generic forward operator (or function) K the task 
is to retrieve u from /. In most realistic situations this retrieval is complicated by 
the ill-posedness of K as well as random noise in /. A widely accepted method that 
approximates this ill-posed problem by a well-posed one and counteracts the noise is 
the method of Tikhonov regularisation. That is, an approximation to the true image 
is computed as a minimiser of 

(1.1) a R{u) + d{K{u),f), 

where i? is a regularising energy that models a-priori knowledge about the image u, 
is a suitable distance function that models the relation of the data / to the 
unknown u, and a > 0 is a parameter that balances our trust in the forward model 
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against the need of regularisation. The parameter a in particular, depends on the 
amount of ill-posedness in the operator K and the amount (amplitude) of the noise 
present in /. A key issue in imaging inverse problems is the correct choice of a, image 
priors (regularisation functionals R), fidelity terms d and (if applicable) the choice 
of what to measure (the linear or nonlinear operator K). Depending on this choice, 
different reconstruction results are obtained. 

While functional modelling (1.1) constitutes a mathematically rigorous and phys¬ 
ical way of setting up the reconstruction of an image - providing reconstruction 
guarantees in terms of error and stability estimates - it is limited with respect to its 
adaptivity for real data. On the other hand, data-based modelling of reconstruction 
approaches is set up to produce results which are optimal with respect to the given 
data. However, in general it neither offers insights into the structural properties of 
the model nor provides comprehensible reconstruction guarantees. Indeed, we believe 
that for the development of reliable, comprehensible and at the same time effective 
models (1.1) it is essential to aim for a unified approach that seeks tailor-made regu¬ 
larisation and data models by combining model- and data-based approaches. 

To do so we focus on a bilevel optimisation strategy for finding an optimal setup 
of variational regularisation models (1.1). That is, for a given training pair of noisy 
and original clean images (/, /o), respectively, we consider a learning problem of the 
form 

(1.2) min F(u*) = cost(u*, fo) subject to u* G arg min {a R(u) + d(K(u), f)} , 

U 

where T is a generic cost functional that measures the fitness of u* to the original im¬ 
age ft). The argument of the minimisation problem will depend on the specific setup 
(i.e. the degrees of freedom) in the constraint problem (1.1). In particular, we pro¬ 
pose a bilevel optimisation approach for learning optimal parameters in higher-order 
total variation regularisation models for image reconstruction in which the arguments 
of the optimisation constitute parameters in front of the first- and higher-order regu¬ 
larisation terms. Rather than working on the discrete problem, as is done in standard 
parameter learning and model optimisation methods, we optimise the regularisation 
models in infinite dimensional function space. We will explain this approach in more 
detail in the next section. Before, let us give an account to the state of the art of 
bilevel optimisation for model learning. In machine learning bilevel optimisation is 
well established. It is a semi-supervised learning method that optimally adapts itself 
to a given dataset of measurements and desirable solutions. In [34, 18, 14], for in¬ 
stance the authors consider bilevel optimization for finite dimensional Markov random 
field models. In inverse problems the optimal inversion and experimental acquisition 
setup is discussed in the context of optimal model design in works by Haber, Horesh 
and Tenorio [20, 21], as well as Ghattas et al. [8, 3]. Recently parameter learning 
in the context of functional variational regularisation models (1.1) also entered the 
image processing community with works by the authors [16, 9], Kunisch, Pock and 
co-workers [26, 13], Chung et al. [15] and Hintermiiller et al. [24]. 

Apart from the work of the authors [16, 9], all approaches so far are formulated 
and optimised in the discrete setting. Our subsequent modelling, analysis and opti¬ 
misation will be carried out in function space rather than on a discretisation of (1.1). 
While digitally acquired image data is of course discrete, the aim of high resolution 
image reconstruction and processing is always to compute an image that is close to 
the real (analogue, infinite dimensional) world. Hence, it makes sense to seek images 
which have certain properties in an infinite dimensional function space. That is, we 
aim for a processing method that accentuates and preserves qualitative properties in 
images independent of the resolution of the image itself [36]. Moreover, optimisation 
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methods conceived in function space potentially result in numerical iterative schemes 
which are resolution and mesh-independent upon discretisation [23]. 

Higher-order total variation regularisation has been introduced as an extension of 
the standard total variation regulariser in image processing. As the Total Variation 
(TV) [32] and many more contributions in the image processing community have 
proven, a non-smooth first-order regularisation procedure results in a nonlinear smooth¬ 
ing of the image, smoothing more in homogeneous areas of the image domain and 
preserving characteristic structures such as edges. In particular, the TV regulariser is 
tuned towards the preservation of edges and performs very well if the reconstructed 
image is piecewise constant. The drawback of such a regularisation procedure becomes 
apparent as soon as images or signals (in ID) are considered which do not only con¬ 
sist of constant regions and jumps, but also possess more complicated, higher-order 
structures, e.g. piecewise linear parts. The artefact introduced by TV regularisation 
in this case is called staircasing [31]. One possibility to counteract such artefacts is 
the introduction of higher-order derivatives in the image regularisation. Chambolle 
and Lions [10], for instance, propose a higher order method by means of an infimal 
convolution of the TV and the TV of the image gradient called Infimal-Convolution 
Total Variation (ICTV) model. Other approaches to combine first and second order 
regularisation originate, for instance, from Chan, Marquina, and Mulet [11] who con¬ 
sider total variation minimisation together with weighted versions of the Laplacian, 
the Euler-elastica functional [29, 12] which combines total variation regularization 
with curvature penalisation, and many more [27, 30] just to name a few. Recently 
Bredies et al. have proposed Total Generalized Variation (TGV) [4] as a higher-order 
variant of TV regularisation. 

In this work we mainly concentrate on two second-order total variation models: 
the recently proposed TGV [4] and the ICTV model of Chambolle and Lions [10]. 
We focus on second-order TV regularisation only since this is the one which seems to 
be most relevant in imaging applications [25, 5]. For D C open and bounded and 
u G i3V(D), the ICTV regulariser reads 

(1.3) ICTV„,/3(n) := “ Vu||_M(n;E2) -h /3 ||DVi;||^(q.]r2x2). 

On the other hand, second-order TGV [7, 6] for u G BV{Q) reads 

(1.4) TGVl^fiiu) := a\\Du - 'w||^(n;M 2 ) -h /3||^w^lU(n;Sym2(R2))- 

Here BD(D) := {rc G L^{Q]W^) \ ||Ft(;||_yv((Q.]Rnxn) < oo} is the space of vector fields 
of bounded deformation on Q, E denotes the symmetrised gradient and Sym^(M^) the 
space of symmetric tensors of order 2 with arguments in M^. The parameters a, (3 are 
fixed positive parameters and will constitute the arguments in the special learning 
problem a la (1.2) we consider in this paper. The main difference between (1.3) and 
(1.4) is that we do not generally have that w = 'Vv for any function v. That results in 
some qualitative differences of IGTV and TGV regularisation, compare for instance 
[1]. Substituting aR{u) in (1.1) by TGV^^(u) or ICTVo^^(u) gives the TGV image 
reconstruction model and the IGTV image reconstruction model, respectively. In this 
paper we only consider the case K = Id identity and d{u, f) = \\u — /||^ 2 (q) in (1-1) 
which corresponds to an image de-noising model for removing Gaussian noise. With 
our choice of regulariser the former scalar a in (1.1) has been replaced by a vector 
(a, (3) of two parameters in (1.4) and (1.3). The choice of the entries in this vector 
now do not only determine the overall strength of the regularisation (depending on 
the properties of K and the noise level) but those parameters also balance between 
the different orders of regularity of the function u, and their choice is indeed crucial 
for the image reconstruction result. Large f3 will give regularised solutions that are 
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close to TV regularised reconstructions, compare Figure 1. Large a will result in TV^ 
type solutions, that is solutions that are regularised with TV of the gradient [22, 30], 
compare Figure 2. With our approach described in the next section we propose a 
learning approach for choosing those parameters optimally, in particular optimally 
for particular types of images. 



(a) Too low d / High oscil¬ 
lation 


(b) Optimal /3 


Figure 1. Effect of /3 on TGV^ denoising with optimal a 



(a) Too low a, low /3. (b) Too low a, optimal /3. (c) Too high a, high /?. 

Good match to noisy data optimal TV^-like behaviour Bad TV^-like behaviour 


Figure 2. 


Effect of choosing a too large in TGV^ denoising 


For the existence analysis of an optimal solution as well as for the derivation of 
an optimality system for the corresponding learning problem (1.2) we will consider 
a smoothed version of the constraint problem (1.1) - which is the one in fact used 
in the numerics. That is, we replace R{u) - being TV, TGV or ICTV in this paper 
- by a Huber regularised version and add an regularisation with a small weight 
to (1.1). In this setting and under the special assumption of box constraints on a 
and /3 we provide a simple existence proof for an optimal solution. A more general 
existence result that holds also for the original non-smooth problem and does not 
require box constraints is derived in [17] and we refer the reader to this paper for a 
more sophisticated analysis on the structure of solutions. 

A main challenge in the setup of such a learning approach is to decide what is the 
best way to measure fitness (optimality) of the model. In our setting this amounts to 
choosing an appropriate distance F in (1.2) that measures the fitness of reconstructed 
images to the ‘perfect’, noise-free images in an appropriate training set. We have 
to formalise what we mean by an optimal reconstruction model. Classically, the 
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difference between the original, noise-free image /o and its regularised version is 
computed with an L 2 cost functional 

(1-^) = \Wa,l3 ~ fo\\L'^{n)^ 

which is closely related to the PSNR quality measure. Apart from this, we propose 
in this paper an alternative cost functional based on a Huberised total variation cost 


( 1 . 6 ) 



\D{Ua,p - /o)l7 dx, 


where the Huber regularisation | • |.y will be defined later on in Definition 2.1. We 
will see that the choice of this cost functional is indeed crucial for the qualitative 
properties of the reconstructed image. 

The proposed bilevel approach has an important indirect consequence: It estab¬ 
lishes a basis for the comparison of the different total variation regularisers employed 
in image denoising tasks. In the last part of the paper we exhaustively compare the 
performance of TV, TGV^ and ICTV for various image datasets. The parameters are 
chosen optimally, according to the proposed bilevel approach, and different quality 
measures (like PSNR and SSIM) are considered for the comparison. The obtained 
results are enlightening about when to use each one of the considered regularisers. In 
particular, ICTV appears to behave better for images with arbitrary structure and 
moderate noise levels, whereas TGV^ behaves better for images with large smooth 
areas. 


Outline of the paper In Section 2 we state the bilevel learning problem for the 
two higher-order total variation regularisation models, TGV and ICTV, and prove 
existence of an optimal parameter pair a, (3. The bilevel optimization problem is anal¬ 
ysed in Section 3, where existence of Lagrange multipliers is proved and an optimality 
system, as well as a gradient formula, are derived. Based on the optimality condition, 
a BFGS algorithm for the bilevel learning problem is devised in Section 5.1. For 
the numerical solution of each denoising problem an infeasible semi-smooth Newton 
method is considered. Finally, we discuss the performance of the parameter learning 
method by means of several examples for the denoising of natural photographs in 
Section 5. Therein, we also present a statistical analysis on how TV, ICTV and TGV 
regularisation compare in terms of returned image quality, carried out on 200 images 
from the Berkeley segmentation dataset BSDS300. 


2. Problem statement and existence analysis 

We strive to develop a parameter learning method for higher-order total variation 
regularisation models that maximises the fit of the reconstructed images to training 
images simulated for an application at hand. For a given noisy image / G L^(H), 
H C open and bounded, we consider 

(2T) mm|R„,;3(n)-h ^||u-/|||2(Q)| . 

where, a, /3 G M. We focus on TGV^ and ICTV image denoising: 

RaA^) = ■= 11“ “ 'f^)ll>l(n;R2) 

uu^iD Lj 


+ 11^ -^^llx(n;SymVRD)‘ 
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and (1.3) with spatial dependence 

Ra,piu) = ICTV„,/ 3 (m) := min \\a {Du - Vi;)||_yK(n;K 2 ) 

\/v&BVin) 

+ 11/3 3i)Vi;||_A4(f7.]R2x2), 

for u G BV{Q). For this model, we want to determine the optimal choice of a,/3, 
given a particular type of images and a fixed noise level. More precisely, we consider a 
training pair (/, /o), where / is a noisy image corrupted by normally distributed noise 
with a fixed variation, and the image /o represents the ground truth or an image that 
approximates the ground truth within a desirable tolerance. Then, we determine the 
optimal choice of a, /3 by solving the following problem 

( 2 . 2 ) min F{ua b) s.t. a,P>0, 

(a,^)eR 2 

where F equals the cost (1.5) or the Huberised TV cost (1.6) and Ua,i 3 for a given 
/ solves a regularised version of the minimization problem ( 2 . 1 ) that will be specified 
in the next section, compare problem (2.3b). This regularisation of the problem is 
a technical requirement for solving the bilevel problem that will be discussed in the 
sequel. In contrast to learning a, /3 in (2.1) in finite dimensional parameter spaces (as 
is the case in machine learning) we aim for novel optimisation techniques in infinite 
dimensional function spaces. 


2.1. Formal statement. Let 11 C MF be an open bounded domain with Lipschitz 
boundary. This will be our image domain. Usually U = (0, w) x (0, h) for w and h 
the width and height of a two-dimensional image, although no such assumptions are 
made in this work. Our data / and /o are assumed to lie in L‘^{Q). 

In our learning problem, we look for parameters (a, /?) that for some cost functional 
F : H^{Q) —)• M solve the problem 

(2.3a) min FiunB) 

(«,/3)eR2 


subject to 


(2.3b) G argmin J'’'’^(tt; a,/3) 

(2.3c) a,/3>0, 

where 

:= ^\\u - + Rl;^^{u). 

Here ,P'^{--,a, j3) is the regularised denoising functional that amends the regularisa¬ 
tion term in (2.1) by a Huber regularised version of it with parameter 7 > 0, and an 
elliptic regularisation term with parameter /r > 0. In the case of TGV^ the modified 
regularisation term R^^iu) then reads for u G H^{Q) 


TGYlyiu) 


mm I ( 

w£H^{n) Jq 


+ 


J^/3 \Ew\^ + f (ll' 


ii^IIhi(o) +11^ 
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and in the case of ICTV we have 
ICTV2^’g(R) := min f a\Du — Vv\^dx 

In ^ + 2 II^^IIhi(d)) ■ 

Here, BI^(r2) = H^{Q,]W^) and the Huber regularisation | • I7 is defined as follows. 

Definition 2.1. Given 7 G (0,oo], we define for the norm || • II2 on the Huber 
regularisation 

I I ^ flbl|2-2^, ||<?||2>l/7, 

[UaWl \\9\\2<i/r 

For the cost functional F, given noise-free data /o G L‘^{Q) and a regularised 
solution u G we consider in particular the cost 

:= -||/o — 

as well as the Huberised total variation cost 

:= [ l^(/o - ^^)l7 dx 
Jn 

with noise-free data /o G BV(n). 


2.2. Existence of an optimal solntion. The existence of an optimal solution for 
the learning problem (2.3) is a special case of the class of bilevel problems considered in 
[17], where existence of optimal parameters in (0, -|-oo]^'^ is proven. For convenience, 
we provide a simplified proof for the case where box constraints on the parameters 
are imposed. We start with an auxiliary lower semicontinuity result for the Huber 
regularised functionals. 

Lemma 2.1. Let u,v € L'P{LI), 1 <p < 00. Then, the funetional u J^\u — v\^ dx, 
where | • [7 is the Huber regularisation in Definition 2.1, is lower semicontinuous with 
respect to weak* convergence in 

Proof. Recall that for g G M™', the Huber-regularised norm may be written in dual 
form as 

I5I7 = sup|(g,5) - '^hWl : Iklb < l}- 

Therefore, we find that 

G{u) := / |u-u |7 dx = sup| / u{x) ■ ip{x) dx - [ ^\\ip{x)\\ldx : 

Jn ^ Jn Jn ^ 

if G C'“(0), ||(y9(a:)||2 < a for every x G n|. 

The functional G is of the form G{u) = sup{(n, p) — G*{ip)}, where G* is the convex 
conjugate of G. Now, let converge to u weakly* in Taking a 

supremising sequence for this functional at any point u, we easily see lower 

semicontinuity by considering the sequences {(u*, gP) — G*{(fJ)}’^.^ for each j. □ 

Our main existence result is the following. 
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Theorem 2.1. We consider the learning problem (2.3) for TGV^ and ICTV regu- 
larisation, optimising over parameters {a,j3) such that 0 < a < a,0 < /3 < p. Here 
(a,/S) < oo is an arbitrary hut fixed vector in that defines a box constraint on the 
parameter space. Then, there exists an optimal solution {a, jd) G for this problem 
for both choices of cost functionals, F = L 2 and F = • 

Proof. Let (an, Pn) C be a minimising sequence. Due to the box constraints 
we have that the sequence (an, (dn) is bounded in M^. Moreover, we get for the 
corresponding sequences of states Un '.= that 

J'^'^(Un,an,ldn) < (w, an, fdn), Vu G H^(0.), 


in particular this holds for u = 0. Hence, 

(2-4) 2\Wn — /Ili2(f2) + < 2^\fW l^ 

Exemplarily, we consider here the case for the TGV regulariser, that is y = 
TGV^’^’^. The proof for the IGTV regulariser can be done in a similar fashion. 
Inequality (2.4) in particular gives 

1111 HI (D) + II^’^IIhI(O) — ~\\f\\Fin)^ 

where Wn is the optimal w for This gives that (un,Wn) is uniformly bounded in 
H^(Pt) xEI^(H) and that there exists a subsequence {(an, /dn, Un, Wn)} which converges 
weakly in x H^(Q,) x]HI^(H) to a limit point (a, fd, u, w). Moreover, —?• u strongly 

in L^(D) and Wn —>• rc in L^(D; M”). Using the continuity of the fidelity term with 
respect to strong convergence in and the weak lower semicontinuity of the 
term with respect to weak convergence in and of the Huber regularised functional 
even with respect to weak* convergence in M (cf. Lemma 2.1) we get 


< 


< 


1,,-. 

- n — 

2" 

h 
2 


/lli^fn) + / a \Du — w\j dx + / P \Ew\.y dx 

^ Jq Jn 


+ ^ ( II^IIhi(d) + II^IIhi(d) 


liminf — \\un — 

n 2 


/IIl2(0) “ \R'Ur. 


— Wn\'y dx + 


I /3 I L/Uifi I 
Jn 


dx 


+ 




Ur, 


\m{n) 


+ \\Wr 


liminf + / an \DUn - Wn\'y dx+ [ /dn lEWnl^y dx 

"■2 ^ ' Jn Jn 


+ W ( ll^nllHl(O) + 11'“^*' 


where in the last step we have used the boundedness of the sequence a (un) from 
(2.4) and the convergence of (an, Pn) hr M^. This shows that the limit point u is an 
optimal solution for (a, (d). Moreover, due to the weak lower semicontinuity of the 
cost functional E and the fact that the set {(a, /3) : 0 < a < d, 0 < /3 < /3} is closed, 
we have that (a, jd,u) is optimal for (2.3). □ 


Remark 2.1. 

• Using the existence result in [17], in principle we could allow inhnite values for 
a and /3. This would include both TV^ and TV as possible optimal regularisers 
in our learning problem. 
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• In [17], in the case of the cost and assuming that 

RlAf) > 

we moreover show that the parameters (a, /?) are strictly larger than 0. In 
the case of the Huberised TV cost this can only be proven in a discretised 
setting. Please see [17] for details. 

• The existence of solutions with = 0, that is without elliptic regularisation, 
is also proven in [17]. Note that here, we focus on the ^ > 0 case since the 
elliptic regularity is required for proving the existence of Lagrange multipliers 
in the next section. 


3. Lagrange multipliers 


In this section we prove the existence of Lagrange multipliers for the learning prob¬ 
lem (2.3) and derive an optimality system that characterizes its solution. Moreover, 
a gradient formula for the reduced cost functional is obtained, which plays an im¬ 
portant role in the development of fast solution algorithms for the learning problems 
(see Section 5.1). 

In what follows all proofs are presented for the TGV^ regularisation case, that is 
0 = TGV^’^. However, possible modifications to cope with the ICTV model will 
also be commented. 

We start by investigating the differentiability of the solution operator. 


3.1. Differentiability of the solution operator. We recall that the TGV^ denois- 
ing problem is given by 


u = {v,w) = argmin 
BVxBD 



V — f\‘^ + f a\Dv 

Jn 


— w\^ 



Using an elliptic regularization we then get 


u = argmin 


a{u, u) -I- ^ “ /P + [ C(\Dv - w\^ -I- [ /3\Ew\j \ , 

^ Jn Jn Jn J 


where a{u,u) = /r (Uujj^i -|- A necessary and sufficient optimality condition 

for the latter is then given by the following variational equation 


(3.1) a(u, T)-!- / ah^{Dv — w){D(l) — (p) dx 

Jn 

+ / j3h-y{Ew)Eip dx + / {v — f)4>dx = 0, for all T G U, 

Jo. Jo. 

where T = and U = H^{Q) x EI^(H). 

Theorem 3.1. The solution operator S' : i— )• t/, which assigns to each pair (a, j3) G 

the corresponding solution to the denoising problem (3.1), is Frechet differentiable 
and its derivative is characterized by the unique solution z = S''(a, ;0)[0i, @ 2 ] ^ U of 
the following linearized equation: 


(3.2) a(z,T)-|- / 9i h^{Dv — w){D(f — ip) dx 


+ / ah'^{Dv — w){Dzi — Z 2 ){D(j) — ip) dx + / 62 h^{Ew)Eip dx 

Jo. J fl 

+ / fdh' {Ew)EZ 2 Eipdx + / zi4>dx = 0, for all T G U. 
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Proof. Thanks to the ellipticity of a(-, •) and the monotonicity of existence of a 
unique solution to the linearized equation follows from the Lax-Milgram theorem. 

Let (, := u'^ — u — z, where u = S{a, /?) and u'^ = S{a + 6 i, /3 + 6 * 2 ). Our aim is to 
prove that \\^\\u = o(|^|)- Combining the equations for u'^, u and 2 : we get that 


a(^,T)+ / {a + 61) h^{Dv~^ — w~^){D(j) — ip) dx — / a h.y{Dv — w){D(p — ip) dx 

J D. J Q, 

— / 01 h.y{Dv — w){D(/) — if) dx — / ah'.y{Dv — w){Dzi — Z2){D(f> — if) dx 

Ju J u 

+ / {fd + 92)h^{Ew^)Eip dx — / j 3 h^{Ew)Eip dx 

J Q J Q 

— 02 h^{Ew)Eip dx — (3 h'.y{Ew)Ez 2 Eip dx + 2 / f^ifdx = 0, for all T G ? 7 , 

Jsi Jo. Jo. 

where f := (^ 1 ,^ 2 ) G H^{Pt) x EI^(17). Adding and subtracting the terms 

I ah' {Dv — w){D 5 y — 5 w){D(j) — ip)dx 
Jn 


and 


/ fdh' {Ew)E5w '■ Eipdx, 

Jn 

where := Va+e — v and 5^, := Wa+e — w, we obtain that 


a(^,'I')+ [ ah'jDv - w){Df^i - i 2 ){D(f) - if) 

Jn 

+ f /3h'.y{Ew)Ef^2 ■ Eipdx + 2 / (ifdx 
Jn Jn 

= — a [h.y{Dv~^ — w^) — h.y{Dv — w) — h'{Dv — w){D 6 y — 6 ^)] {Df — ip) 

Jn 

— 01 [h.y{Dv~^ — w^) — h.y{Dv — w)] (Dcf) — ip) dx 

Jn 

— / 13 [h.y{Ew~^) — h.y{Ew) — h' {Ew)E 6 w] : Eipdx 

Jn 

— 02 [h.y{Ewa+e) — h.y{Ew)] : Eipdx, for all T G ?7. 

Jn 

Testing with T = ^ and using the monotonicity of /i^y(-) we get that 


IICIIh < C ||a| \\h^{Dv^ - w^) - h^{Dv - w) - h'.y{Dv - w){D5v - Sw )\\^2 

+ |0i| — tc"'') — hry{DV — 11^)11^2 

+ \(3\ \\h.yiEw^) - h.y{Ew) - h'^iEw)E6^\\^^ 

+\02\ \\h^{Ewa+e) - h.y{Ew)\\^ 2 } , 

for some generic constant C > 0. Considering the differentiability and Lipschitz 
continuity of it then follows that 

(3.3) WfWu < C (^\a\ o(||u+ - + l^i] Wu^+e - u\\^ 

+ l/5| + \ 02 \ \\Wa+e - W'IIhI(D)) > 
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where || • ||i^p stands for the norm in the space From regularity results for 

second order systems (see [19, Thm. 1, Rem. 14]), it follows that 

- '“lli^p < L\^\ (l|Div h^{Dv - ic)||-i,p + \\h^{Dv - iu)||_i,p + jjDiv h^{Ew)\\-i^p) 
< L\e\ {2\\h^{Dv - w)\\loo + \\h^{Ew)\\L^) 

<m, 

since \h^{-)\ < 1- Inserting the latter in estimate (3.3), we finally get that 

llelll/ = o(|0|). 


□ 

Remark 3.1. The Frechet differentiability proof makes use of the quasilinear struc¬ 
ture of the TGV^ variational form, making it difficult to extend to the ICTV model 
without further regularisation terms. For the latter, however, a Gateaux differentia¬ 
bility result may be obtained using the same proof technique as in [16]. 

3.2. The adjoint equation. Next, we use the Lagrangian formalism for deriving the 
adjoint equations for both the TGV^ and ICTV learning problems. Existence of a 
solution to the adjoint equation then follows from the well-posedness of the linearized 
equation. 

Defining the Lagrangian associated to TGV^ learning problem by: 

C{v, w, a, /3,pi,P2) = F{u) + p{w,p 2 )m^ 

+ / ah^{Dv - w){Dpi - P 2 ) + / /3h^{Ew)Ep2 + / {v - f )pi, 

J ^ J ^ J ^ 

and taking the derivative with respect to the state variable (v, w), we get the necessary 
optimality condition 


^[u,v)i'^^v,a,/3,pl,p2)[{6y,6^,)] = F'{u)6u + p{pi,Sy)Hi + p{p2,6^)ai 
-I- / ah'^{Dv - w){D6v - 5u,)(Dpi - P 2 ) 


-h / Ph' {Ew)E6wEp2 -h / piSv = 0. 


in 


If (5„, = 0, then 


/ ah'^{Dv - w){Dpi - p2)D5y + / pi6y = -VyF{u)6y, 
Jn J n 

whereas if = 0, then 


p{P 2 ,Sw)b^ - / ah'JDv - w){Dpi - P2)5 w 
Jn 

+ [ iJh'JEw) Ep2 E6w = -VwF{u)6w 
Jn 

Existence of a unique solution then follows from the transposition method, since 
the linearised equation is well-posed. 


Remark 3.2. For the ICTV model it is possible to proceed formally with the La¬ 
grangian approach. We recall that a necessary and sufficient optimality condition for 
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the ICTV functional is given by 

(3.4) ij{u,(j))Hi + + ah^{Du-'Vv){D(l)-Vif) 

Jn 

+ [ /3h^{DVv)DVip+ [ {u-f)(j) = 0, for all € H\n) xm\n) 

Jq Ju 

and the correspondent Lagrangian functional C is given by 

C{u,v,a,P,pi,p 2 ) = F{u) + p{u,pi)hi +/i(Vu, Vp2)hi 

+ f ah^{Du-V v) {Dpi-Vp 2 )+ f / 3 h^{DVv)DVp 2 + [iu-f)pi. 

J ^ J ^ J Vt 

Deriving the Lagrangian with respect to the state variable (n, v) and setting it equal 
to zero yields 

= F'{u)5u + p{pi,6u)h^ +^(Vp 2 , V(5^)hi 

+ [ ah' {Du-Vv){D 5 u-V 6 y){Dpi-Vp 2 ) 

Jn 

+ [ /3h'^{DVv)DV6^DVp2+ / = 0 . 

J Q Jn 

By taking succesively 6 v = 0 and 6 u = 0, the following system is obtained 

(3.5a) fi{pi, 6 u)m + / ah'^{Du - Vv){Dpi - Vp 2 )D 6 u + / pA = -F'{u) 6 u- 
Jn J n 

(3.5b) /i(Vp2, V5i,)ei + / ah' {Du-Vv){Dpi-Vp 2 )VSv 

Jn 

+ [ phUDVv)DVp2DV6y = 0 . 
Jn 

3.3. Optimality condition. Using the differentiability of the solution operator and 
the well-posedness of the adjoint equation, we derive next an optimality system for 
the characterization of local minima of the bilevel learning problem. Besides the opti¬ 
mality condition itself, a gradient formula arises as byproduct, which is of importance 
in the design of solution algorithms for the learning problems. 

Theorem 3.2. Let {ci,j3) G he a local optimal solution for problem (2.3). Then 
there exist Lagrange multipliers n G [/ and Ai,A 2 G LP‘{LL) such that the following 
system holds: 

(3.6a) o(tt,'k)-|-a / h^{Dv — w){D(l> — ip) dx 
Jn 

+ /J [ h^{Ew)Eipdx + 2 f {v — f)4>dx = 0, for all 4;' G H^{Ll) x ]HI^(D), 

Jn Jn 

(3.6b) a(n, 4')-|-Q! j h'JDv — w){Dpi — p 2 ){D(f — ip) dx 
Jn 

+/3 f h'{Ew)Ep 2 Eipdx + 2 f pi(fdx = —Eu{u)['I/], for all'It £ H^{Ll) xM^{Ll), 


Ai = / h^{Dv - w){Dpi - P 2 ) 
Jn 


(3.6c) 
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(3.6d) 

(3.6e) 

(3.6f) 


A2 = / h^{Ew) Ep2 
Jn 

Ai > 0 , A 2 ^ 0 

Ai • d = A 2 • /3 = 0. 


Proof. Consider the reduced cost functional E{a,l3) = E{u{a, 13)). The bilevel opti¬ 
mization problem can then be formulated as 

min E(a,B), 

{a,l 3 )eC 

where —)> M and C corresponds to the positive orthant in From [38, 

Thm. 3.1], there exist multipliers Ai,A 2 such that 

Xi = V^E{aJ) 

X2 = V0E{aJ) 

Ai > 0 , A 2 ^ 0 

Ai • d = A 2 • /3 = 0 , 

By taking the derivative with respect to (a,/?) and denoting by u' the solution to 
the linearized equation (3.2) we get, together with the adjoint equation (3.6b), that 

E'{a, (3)[(p] = Fuiu)u'{a,P)[(j)] 

= —a{Il,u') — a I h' {Dv — w){Dpi — P 2 ){Dv' — w') 

Jn 

- f3 [ h' {Ew)Ep 2 Ew' - 2 [ pW 
Jn Jn 

=—a{u',11) — a / h' {Dv — w){Dv'— w'){Dpi — P 2 ) 

Jn 

-P j h^^{Ew)Ew' Ep2-2 

Jn Jn 

which, taking into account the linearized equation, yields 
(3.7) 

Altogether we proved the result. 


v'pi 


E'{a, f 3 )[(l)] = / h^{Dv - w){Dpi - P2) + (t>2 / hy{Ew)Ep2. 

Jn J n 


□ 


Remark 3.3. From the existence result (see Remark 2.1), we actually know that, 
under some assumptions, d and ^ are strictly greater than zero. This implies that 
the multipliers Ai = A 2 = 0 and the problem is of unconstrained nature. This plays 
an important role in the design of solution algorithms, since only a mild treatment of 
the constraints has to be taken into account, as will be showed in Section 6. 


4. Numerical algorithms 

In this section we propose a second order quasi-Newton method for the solution of 
the learning problem with scalar regularisation parameters. The algorithm is based 
on a BFGS update, preserving the positivity of the iterates through the line search 
strategy and updating the matrix cyclically depending on the satisfaction of the cur¬ 
vature condition. For the solution of the lower level problem, a semismooth Newton 
method with a properly modified Jacobi matrix is considered. Moreover, warm ini¬ 
tialisation strategies have to be taken into account in order to get convergence for 
the TGV^ problem. The developed algorithm is also extended to a simple linear 
polynomial case. 
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4.1. BFGS algorithm. Thanks to the gradient characterization obtained in The¬ 
orem 3.2, we next devise a BFGS algorithm to solve the bilevel learning problems. 
We employ a few technical tricks to ensure convergence of the classical method. In 
particular, for numerical stability we need to avoid the boundary of the constraint 
set on the parameters, so we pick 0 < 0 < 0, considered numerically almost zero or 
inhnity, respectively, and require the box constraints 

(4.1) e<a,p<e. 

We also limit the step length to get at most a fraction closer to the boundary. As we 
show in [17] the solution is in the interior for the regularisation and cost functionals 
we are interested in. Below this limit, we use Armijo line search. 

Moreover, the good behaviour of the BFGS method depends upon the BFGS matrix 
staying positive definite. This would be ensured by the Wolfe conditions, but because 
of our step length limitation, the curvature condition is not necessarily satisfied. (The 
Wolfe conditions are guaranteed to be satisfied for some step length a, if our domain 
is unbounded, but the range where the step satisfies the criterion, may be beyond 
our maximum step length, and is not necessarily satisfied closer to the current point.) 
Instead we skip the BFGS update if the curvature is negative. 

Overall our learning algorithm may be written as follows. 

Algorithm 4.1 (BFGS for denoising parameter learning). Pick Armijo line search 
constant c, and target residual p. Pick initial iterate Solve the denoising 

problem (2.3b) for (a,/3) = (q:°,/ 3°), yielding vP. Initialise = I. Set i := 0, and 
iterate the following steps: 

(1) Solve the adjoint equation (3.6b) for 11*, and calculate VJ^(a*,/3*) from (3.7). 

(2) If f > 2 do the following: 

(a) Set s := (a^/3*) - (a*-\/3*-i), and r := (3^) - 

(b) Perform the BFGS update 

• ^ Jb*-!, s'^r < 0, 

■ \b*-i - 

(3) Compute Sa ,/3 from 

= g\ 

(4) Initialise a := min{l, (Tniax/2}, where 

o'max := max{(T > 0 I {a\j3^) + cr6a,i3 satisfies (4.1)}. 

Repeat the following: 

(a) Let (ao-,/3(j) := (a*,/3*) -|- crSa^p, and solve the denoising problem (2.3b) 
for (a,/3) = {aa,l3a), yielding Ua- 

(b) If the residual \\{aa,f3a) — (a*, ;d*)||/||(aa-, /3cr)|| < p do the following: 

(i) If miiia T{aa, Pa) < B'(a*,/3*) over all a tried, choose a* the min¬ 
imiser, set (a*^^,/I*"''^) := {aa*,Pa*), := Ua*, and continue 

from Step 5 

(ii) Otherwise end the algorithm with solution (a*,P*) := (a*,/3*). 

(c) Otherwise, if Armijo condition J^(q;o-,/J o-) < J^(a*,/3*)-|-(TcVJ^(a*,/J*)’^5 q,/ 3 
holds, set (a*"''^,/J*+^) := {aa,Pa), := Ua, and continue from Step 5. 

(d) In all other cases, set a := a/2 and continue from Step 4a. 

(5) If the residual ||(a*"*"^, ;0*+^) — (a*,/3 *)||/||(q;*'''^,/3*’''^)|| < p, end the algorithm 
with (a*, P*) := (a*^^, /J*"*"^)- Otherwise continue from Step 1 with i := i + 1- 

Step (4) ensures that the iterates remain feasible, without making use of a pro¬ 
jection step. This is justified since it’s been analytically proved that the optimal 
parameters are greater than zero (see [17]). 
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4.2. An infeasible semi-smooth Newton method. In variational form, the TGV^ 
denoising problem can be written as 

[i f {Dv ■ D4> + v(j)) + f ah^{Dv — w)D(j) + f (v — f)4> = 0, 

Jq, Jq, Jq 


/ {Ew : Eip + wip) — 
Jq 


ah^{Dv — w)Dip 


+ / l3h^{Ew) Eip = 0, 

Jn 


or, in general abstract primal-dual form, as 


Vy? G 


(4.2a) 

(4.2b) 


N 

Lu + A* qj = f in n 

i=l 

max{l/7, |[AjM](x)| 2 }(?j(x) — aj[Ajrt](x) = 0 a.e. inn, j = l,...,N. 


where L G M™), M™)') is a second order linear elliptic operator, Aj, j = 

1,..., N, are linear operators acting on u and qj{x), j = 1, . .., N , correspond to the 
dual multipliers. 

Let us set 

mj{u) := max{l/ 7 , |[Aju](x)| 2 }. 

Let us also define the diagonal application D{u) : L^(n;M™) —)• L^(n;M™') by 

['D{u)q]{x) = u{x)q{x), {x G Cl) 


We may derive Vu[S)(mj(u))(7j] being defined by 
Vu[2D(mj (u))pj] = A*'D{qj)'y\{Aju) where ^{z) 


0 , 

\z(x)\2 ’ 


z{x )\2 < 1/7 
z{x )\2 > 1 / 7 - 


Then (4.2a), (4.2b) may be written as 

N 

Lu + A* qj = f in n 

i=l 

T){mj{u))qj — OijAjU = 0, a.e. in Cl, {j = 1,, N). 

Linearising, we obtain the system 
(SSN-1) 


( L Al ... A*j^ \ 


/ \ 

—aiAi + '^{^Aiu)'D{qi)Ai D(mj(tt)) 0 0 


dqi 

: 0 0 



\^—a]\fA]\[ + ^{Aisfu)^{qi\f)A]sf 0 0 iD(mAr(u))y 


YQn/ 


where 

/ -Lu - ZZi + / \ 
aiAiu — D(mi(u))qi 

R := 

\aNANU - 'D{mN{u))qNj 

The semi-smooth Newton method solves (SSN-1) at a current iterate {u^,q\,.. .q}^). 
It then updates 


(SSN-2) 


(u*+\5"i+\ .. := {u^ + T6u,q\ + T6qi,qN + rSqN) 
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for a suitable step length r, allowing to become infeasible in the process. That 
is, it may hold that \q^^^{x )\2 > aj, which may lead to non-descent directions. In 
order to globalize the method, one projects 

(SSN-3) := Qfj), where q;)(x) := sgn((7(x)) min{Q;, |(7(x)|}, 

in the building of the Jacobi matrix. Following [23, 33], it can be shown that a 
discrete version of the method (SSN-l)-(SSN-3) converges globally and locally su- 
perlinearly near a point where the subdifferentials of the operator on (u, qi,... q^) 
corresponding (4.2) are non-singular. Further dampening as in [23] guarantees local 
super linear convergence at any point. We do not represent the proof, as going into 
the discretisation and dampening details would expand this work considerably. 


Remark 4.1. The system (SSN-1) can be simplified, which is crucial to obtain ac¬ 
ceptable performance with TGV^. Indeed observe that B is invertible, so we may 
solve 6u from 

N 

(4.3) B6u = Ri- 

i=i 

Thus we may simplify 6u out of (SSN-1), and only solve for 6qi,... ,6qN using a 
reduced system matrix. Finally we calculate 6u from (4.3). 


For the denoising sub-problem (2.3b) we use the method (SSN-l)-(SSN-3) with 
the reduced system matrix of Remark 4.1. Here, we denote by z in the case of TGV^ 
the parameters 

Z = {v,w), 


and in the case of IGTV 

z = (u, v). 

For the calculation of the step length r, we use Armijo line search with parameter 
c = 1 e“^. We end the SSN iterations when 


, Way'll 

^max{l, lly*]]} 


< lE 


where Jy* = {6z\6q\,... ,6q}^), and y* = {z\q\,... ,q)^). 


4.3. Warm initialisation. In our numerical experimentation we generally found Al¬ 
gorithm 4.1 to perform well for learning the regularisation parameter for TV denoising 
as was done in [16]. For learning the two (or even more) regularisation parameters for 
TGV^ denoising, we found that a warm initialisation is needed to obtain convergence. 
More specifically, we use TV as an aid for discovering both the initial iterate (a^, /3°) 
as well as the initial BFGS matrix B^. This is outlined in the following algorithm. 

Algorithm 4.2 (BFGS initialisation for TGV^ parameter learning). Pick a heuristic 
factor Jo > 0. Then do the following: 

(1) Solve the corresponding problem for TV using Algorithm 4.1. This yields 
optimal TV denoising parameter a^y, as well as the BFGS estimate Rtv for 
V2.F(a^y). 

(2) Run Algorithm 4.1 for TGV^ with initialisation (a^, (3^) := (a^yJo, c^tv)’ 
initial BFGS matrix B^ := diag(RTV<5o) .^tv)- 

With 11 = (0,1)^, we pick Jq = l/-^; where the original discrete image has i x £ 
pixels. This corresponds to the heuristic [35, 2] that if £ ~ 128 or 256 and the discrete 
image is mapped into the corresponding domain Q = (0, 1)“^ directly (corresponding 
to spatial step size of one in the discrete gradient operator), then j3 G (a, 1.5a) tends 
to be a good choice. We will later verify this through the use of our algorithms. 
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Now, if / G BV((0,£)^) is rescaled to BV((0,1)^), i.e. f{x) := f{x/£), then with 
u{x) := uixjt') and w{x) := we have 

(4-4) 2^\f ~ '“IIl2((o,£)2 ) + aWDu - 'R’||a4((o,£) 2;R2) + /3||St(;||;v(((o,£)2.]R2x2) 

= (^2 ~ “IIl 2 ((o,i) 2 ) + na\\Du - 'R’||x((o,i) 2 ;r 2 ) -|- n^/ 3 ||£'u;||_v({(o,i) 2 ;]R 2 x 2 ) 

This introduces the factor l/£ = between rescaled a, j3. 

5. Experiments 

In this section we present some numerical experiments to verify the theoretical 
properties of the bilevel learning problems and the efficiency of the proposed solution 
algorithms. In particular, we exhaustively compare the performance of the new pro¬ 
posed cost functional with respect to well-known quality measures, showing a better 
behaviour of the new cost for the chosen tested images. The performance of the 
proposed BEGS algorithm, combined with the semismooth Newton method for the 
lower level problem, is also examined. 

5.1. Gaussian denoising. We tested Algorithm 4.1 for TV and Algorithm 4.2 for 
TGV^ Gaussian denoising parameter learning on various images. Here we report the 
results for two images, the parrot image in Figure 4a, and the geometric image in 
Figure 5. We applied synthetic noise to the original images, such that the PSNR of 
the parrot image is 24.7, and the PSNR of the geometric image is 24.8. 

In order to learn the regularisation parameter a for TV, we picked initial a® = 
Q.ljf,. For TGV^ initialisation by TV was used as in Algorithm 4.1. We chose the 
other parameters of Algorithm 4.1 as c = 1 e“^, p = 1e“^, 9 = 1e— 8 , and 0 = 10. 
For the SSN denoising method the parameters 7 = 100 and p, = 1e“^^ were chosen. 

We have included results for both the L^-squared cost functional and the Hu- 
berised total variation cost functional T^V. The learning results are reported in 
Table 1 for the parrot images, and Table 2 for the geometric image. The denoising 
results with the discovered parameters can be found in the aforementioned Figure 4 
and Figure 5. We report the resulting optimal parameter values, the cost functional 
value, PSNR, SSIM [37], as well as the number of iterations taken by the outer BFGS 
method. 

Our first observation is that all approaches successfully learn a denoising parameter 
that gives a good-quality denoised image. Secondly, we observe that the gradient cost 
functional performs visually and in terms of SSIM significantly better for TGV^ 
parameter learning than the cost functional terms of PSNR the roles are 

reversed, as should be, since the is equivalent to PSNR. This again confirms that 
PSNR is a poor quality measure for images. For TV there is no significant difference 
between different cost functionals in terms of visual quality, although the PSNR and 
SSIM differ. 

We also observe that the optimal TGV^ parameters {a*, (3*) generally satisfy 
(3* ja* G (0.75,1.5)/£. This confirms the earlier observed heuristic that if ^ « 128, 256 
then /? G (1, 1.5)q; tends to be a good choice. As we can observe from Figure 4 and 
Figure 5, this optimal TGV^ parameter choice also avoids the stair-casing effect that 
can be observed with TV in the results. 

In Figure 3, we have plotted by the red star the discovered regularisation parameter 
(a*, /3*) reported in Figure 4. Studying the location of the red star, we may conclude 
that Algorithm 4.1 and Algorithm 4.2 manage to find a nearly optimal parameter in 
very few BFGS iterations. 
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a 


(a) Parrot, TGV^, cost functional 



(b) 


0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 


Parrot, TGV^, Ln cost functional 


Figure 3. Cost functional value versus (a, (3) for TGV^ denoising, 
for the parrot test images, for both and L^V cost functionals. The 
illustrations are contour plots of function value versus (a,/3). 


Table 1. Quantified results for the parrot image {£ = 256 = 
image width/height in pixels) 


Denoise 

Gost 

Initial (a,/3) 

Result {a*,/3*) 

Cost 

SSIM 

PSNR 

Its. 

Fig. 

TGV^ 

LJV 


(0.069/£^0.051/£) 

6.615 

0.897 

31.720 

12 

4(c) 

TGV^ 

-^2 


(0.058/^2,0.041/£) 

6.412 

0.890 

31.992 

11 

4(d) 

IGTV 

LiV 


(0.068/^2,0.051/4 

6.656 

0.895 

31.667 

16 

4(e) 

IGTV 

^2 


(0.051/£2,0.041/4 

6.439 

0.887 

31.954 

7 

4(f) 

TV 

LiV 

0.l/£ 

0.057/^ 

6.944 

0.887 

31.298 

10 

4(g) 

TV 

4 

0.l/£ 

0M2/£ 

6.623 

0.879 

31.710 

12 

4(h) 


Table 2. Quantified results for the synthetic image {£ = 256 = 
image width/height in pixels) 


Denoise 

Cost 

Initial a 

Result d* 

Value 

SSIM 

PSNR 

Its. 

Fig. 

TGV^ 

LJV 


(0.453/£^0.071/£) 

3.769 

0.989 

36.606 

17 

5(c) 

TGV^ 

-^2 


(0.307/£2,0.055/4 

3.603 

0.986 

36.997 

19 

5(d) 

ICTV 

LiV 


(0.505/£2,0.103/4 

4.971 

0.970 

34.201 

23 

5(e) 

ICTV 

4 

{a^y/£, a^y) 

(0.056/^2,0.049/4 

3.947 

0.965 

36.206 

7 

5(f) 

TV 

TiV 

0.l/£ 

0.136/£ 

5.521 

0.966 

33.291 

6 

5(g) 

TV 

4 

0.l/£ 

0.052/£ 

4.157 

0.948 

35.756 

7 

5(h) 


5.2. Statistical testing. To obtain a statistically significant outlook to the perfor¬ 
mance of different regularisers and cost functionals, we made use of the Berkeley 
segmentation dataset BSDS300 [28], displayed in Figure 6. We resized each image to 
128 pixels on its shortest edge, and take the 128 x 128 top-left square of the image. 
To this data set, we applied pixelwise Gaussian noise of variance = 2,10, and 
20. We tested the performance of both cost functionals, L^V and as well as the 
TGV^, ICTV, and TV regularisers on this dataset, for all noise levels. In the first 
instance, reported in Figures 7-10 (noise levels cr^ = 2,20 only), and Tables 3-5, we 
applied the proposed bi-level learning model on each image individually, to learn the 
optimal parameters specifically for that image, and a correponding noisy image for 
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(a) Original image 


(c) TGV^ denoising, cost (d) TGV^ denoising, L 2 cost 


(e) IGTV denoising, L^V cost (f) IGTV denoising, cost 


(g) TV denoising, L^V cost (h) TV denoising, L 2 cost 


Figure 4. Optimal denoising results for initial guess a 
TGV^ and a = 0.1/£ for TV 
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Figure 6. The 200 images of the Berkeley segmentation dataset 
BSDS300 [28], cropped to be rectangular, keeping top left corner, and 
resized to 128 x 128. 

all of the noise levels separately. For the algorithm, we use the same parametrisation 
as in Section 5.1. 

The figures display the noisy images, and indicate by colour coding the best result 
as judged by the structural similarity measure SSIM [37], PSNR, and the objective 
function value (T^V or cost). These criteria are, respectively, the top, middle, and 
bottom rows of colour-coding squares. Red square indicates that TV performed the 
best, green square indicates that ICTV performed the best, and blue square indicates 
that TGV^ performed the best—this is naturally for the optimal parameters for the 
corresponding regulariser and cost functional discovered by our algorithms. 

In the tables, we report the information in a more concise numerical fashion, indi¬ 
cating the mean, standard deviation, and median for all the different criteria (SSIM, 
PSNR, and cost functional value), as well as the number of images for which each reg¬ 
ulariser performed the best. We recall that SSIM is normalised to [0,1], with higher 
value better. Moreover, we perform a statistical 95% one-tailed paired t-test on each 
of the criteria, and a pair of regularisers, to see whether any pair of regularisers can 
be ordered. If so, this is indicated in the last row of each of the tables. 

Overall, studying the t-test and other data, the ordering of the regularisers appears 
to be 

ICTV > TGV^ > TV. 

This is rather surprising, as in many specific examples, TGV^ has been observed to 
perform better than ICTV, see our Figures 4 and 5, as well as [4, 1]. Only when the 
noise is high, appears TGV^ to come on par with ICTV with the cost functional 
in Figure 9 and Table 5. 

A more detailed study of the results in Figures 7-10 seems to indicate that TGV^ 
performs better than ICTV when the image contains large smooth areas, but ICTV 
generally performs better for more chaotic images. This observation agrees with the 
results in Figures 4 and 5, as well as [4, 1], where the images are of the former type. 

One possible reason for the better performance of ICTV could be that TGV^ has 
more degrees of freedom—in ICTV we essentially constrain w = 'Vv for some function 
V —and therefore overfits to the noisy data, until the noise level becomes so high that 
overfitting would become too high for any parameter. To see whether this is true, we 
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Figure 7. Ordering of regularisers with individual learning, cost, 
and noise variance = 2, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 



Figure 8. Ordering of regularisers with individual learning, cost, 
and noise variance = 2, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 


also performed batch learning, learning a single set of parameters for all images with 
the same noise level. That is, we studied the model 

^ 1 
min s.t. G argmin -||/i - 

“ 7 ^ neHi(n) ^ 

F'iiu) = - u\\l 2 ^^y or Fi{u) = j^\V{fo^i-u)\^dx, 


with 
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Figure 9. Ordering of regularisers with individual learning, cost, 
and noise variance = 20, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 



Figure 10. Ordering of regularisers with individual learning, cost, 
and noise variance = 20, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 


where a = (a, /3), /i,..., /at are the N = 200 noisy images with the same noise level, 
and /o,i,..., /o,Ar the original noise free images. 

The results are in Figures 11-14 (noise levels cr^ = 2,20 only), and Tables 6-8. 
The results are still roughly the same as with individual learning. Again, only with 
high noise in Table 8, does TGV^ not lose to ICTV. Another interesting observation 
is that TV starts to be frequently the best regulariser for individual images, although 
still statistically does worse than either ICTV or TGV^. 
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SSIM 

PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.978 

0.015 

0.981 

0 

41.56 

0.86 

41.95 

0 

2.9e'‘ 

3.1e^ 

2.9e^ 

0 

LJ,V-TV 

0.988 

0.005 

0.989 

1 

42.57 

1.10 

42.46 

5 

2.4e'‘ 

3.7e^ 

2.5e^ 

1 

L^V-ICTV 

0.989 

0.005 

0.990 

141 

42.74 

1.16 

42.62 

143 

2.3e‘‘ 

3.9e® 

2.4e^ 

137 

L^V-TGV^ 

0.989 

0.005 

0.989 

58 

42.70 

1.17 

42.55 

52 

2.4e‘‘ 

4.0e® 

2.5e^ 

62 

95% t-test 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

L'i-TV 

0.988 

0.005 

0.988 

2 

42.64 

1.14 

42.50 

2 

0.41 

0.08 

0.43 

2 

Li-ICTV 

0.988 

0.005 

0.989 

142 

42.79 

1.18 

42.64 

148 

0.39 

0.08 

0.41 

148 

Li-TGV^ 

0.988 

0.005 

0.989 

56 

42.76 

1.19 

42.58 

50 

0.40 

0.08 

0.42 

50 

95% t-test 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 


Table 3. Regularise! performance with individual learning, and 
L^V costs and noise variance = 2; BSDS300 dataset, resized. 



SSIM 

PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.731 

0.120 

0.744 

0 

27.72 

0.88 

28.09 

0 

1.4e'’ 

2.5e^ 

1.4e^ 

0 

Liy-TY 

0.898 

0.036 

0.900 

4 

31.28 

1.63 

30.97 

8 

7.3e'‘ 

2.2e^ 

7.3e^ 

1 

L^V-IGTV 

0.906 

0.034 

0.909 

139 

31.54 

1.68 

31.21 

142 

7.1e'‘ 

2.2e^ 

7.1e^ 

121 

L^V-TGV^ 

0.905 

0.035 

0.907 

57 

31.47 

1.72 

31.10 

50 

7.1e'‘ 

2.2e'‘ 

7.1e^ 

78 

95% t-test 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

L^-TV 

0.897 

0.033 

0.898 

9 

31.54 

1.76 

31.15 

2 

5.52 

1.89 

5.51 

2 

Li-IGTV 

0.903 

0.032 

0.903 

131 

31.72 

1.76 

31.33 

148 

5.30 

1.81 

5.35 

148 

Li-TGV^ 

0.902 

0.033 

0.903 

60 

31.67 

1.80 

31.28 

50 

5.38 

1.87 

5.39 

50 

95% t-test 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 


Table 4. Regularise! performance with individual learning, ^^^1 
L^V costs and noise variance = 10; BSDS300 dataset, resized. 



SSIM 

PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.505 

0.143 

0.516 

0 

21.80 

0.92 

22.14 

0 

2.8e'’ 

7.9e^ 

2.8e'" 

0 

Liy-TY 

0.795 

0.063 

0.799 

7 

27.27 

1.64 

27.02 

11 

I.Oe^ 

3.5e^ 

9.7e'‘ 

1 

L^V-IGTV 

0.810 

0.061 

0.814 

120 

27.52 

1.66 

27.24 

125 

9.7e'‘ 

3.4e'‘ 

O.dE-^ 

79 

L^V-TGV^ 

0.808 

0.062 

0.814 

73 

27.50 

1.74 

27.15 

64 

9.8e'‘ 

3.5e'‘ 

9.5e-‘ 

120 

95% t-test 

IGTV > TGV^ > TV 

IGTV, TGV^ > TV 

IGTV, TGV^ > TV 

L'i-TY 

0.802 

0.056 

0.804 

8 

27.70 

1.93 

27.28 

0 

13.65 

5.53 

13.14 

0 

Li-IGTV 

0.811 

0.056 

0.816 

126 

27.86 

1.91 

27.45 

138 

13.14 

5.22 

12.62 

138 

Li-TGV^ 

0.810 

0.057 

0.814 

66 

27.83 

1.94 

27.41 

62 

13.28 

5.38 

12.77 

62 

95% t-test 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

IGTV > TGV^ > TV 


Table 5. Regularise! performance with individual learning, 
L^V costs and noise variance = 20; BSDS300 dataset, resized. 


For the first image of the data set, ICTV does in all of the Figures 7-14 better 
than TGV^, while for the second image, the situation is reversed. We have highlighted 
these two images for the cost in Figures 15-18, for both noise levels a = 2 and 
a = 20. In the case where ICTV does better, hardly any difference can be observed 
by the eye, while for second image TGV^ clearly has less stair-casing in the smooth 
areas of the image, especially with the noise level a = 20. 

Based on this study, it therefore seems that ICTV is the most reliable regularise! of 
the ones tested, when the type of image being processed is unknown, and low SSIM, 
PSNR or cost functional value is desired. But as can be observed for individual 
images, it can within large smooth areas exhibit artefacts that are avoided by the use 
of TGV^. 
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Figure 11. Ordering of regularisers with batch learning, L^V cost, 
and noise variance = 2, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 



Figure 12. Ordering of regularisers with batch learning, cost, and 
noise variance = 2, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 


5.3. The choice of cost functional. The L| cost functional naturally obtains bet¬ 
ter PSNR than R^V, as the two former are equivalent. Comparing the results for 
the two cost funtionals in Tables 3-5, we may however observe that for low noise 
levels = 2,10, and generally for batch learning, attains better (higher) SSIM. 
Since SSIM better captures [37] the visual quality of images than PSNR, this recom¬ 
mends the use of our novel total variation cost functional R^V. Of course, one might 
attempt to optimise the SSIM. This is however a non-convex functional, which will 
pose additional numerical challenges avoided by the convex total variation cost. 
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Figure 13. Ordering of regularisers with batch learning, L^V cost, 
and noise variance = 20, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 



Figure 14. Ordering of regularisers with batch learning, cost, and 
noise variance = 20, on the 200 images of the BSDS300 dataset, 
resized. Best regulariser: red=TV, green=ICTV, blue=TGV^; 
top=SSIM, middle=PSNR, bottom=objective value. 

Conclusion and Outlook 

In this paper we propose a bilevel optimisation method in function space for learn¬ 
ing the optimal choice of parameters in higher-order total variation regularisation. 
We present a rigorous analysis of this optimisation problem as well as a numerical 
discussion in the context of image denoising. In particular, we make use of the bilevel 
learning approach to compare the performance - in terms of returned image qual¬ 
ity - of TV, ICTV and TGV regularisation. A statistical analysis, carried out on a 
dataset of 200 images, suggest that ICTV performs slightly better than TGV, and 












OPTIMAL HIGHER-ORDER TOTAL VARIATION REGULARISATION 


27 



SSIM 

1 PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.978 

0.015 

0.981 

16 

41.56 

0.86 

41.95 

24 

2.9e* 

3.1e^ 

2.9e^ 

16 

Liy-TV 

0.987 

0.006 

0.988 

23 

42.43 

1.07 

42.37 

21 

2.5e^ 

3.4e^ 

2.5e^ 

20 

L^V-ICTV 

0.988 

0.006 

0.989 

119 

42.56 

1.06 

42.51 

135 

2Ae* 

3.5e® 

2.5e^ 

113 

L^V-TGV^ 

0.987 

0.006 

0.989 

42 

42.51 

1.09 

42.44 

20 

2Ae* 

3.6e® 

2.5e^ 

51 

95% t-test 

IGTV > TGV^ > TV 

1 IGTV > TGV^ > TV 

IGTV > TGV^ > TV 

L'i-TY 

0.986 

0.007 

0.987 

13 

42.46 

0.95 

42.43 

17 

0.42 

0.07 

0.43 

17 

Li-ICTV 

0.987 

0.007 

0.988 

139 

42.57 

0.95 

42.56 

128 

0.41 

0.07 

0.42 

128 

Li-TGV^ 

0.987 

0.007 

0.988 

38 

42.53 

0.97 

42.51 

40 

0.41 

0.07 

0.42 

40 

95% t-test 

IGTV > TGV^ > TV 

1 IGTV > TGV^ > TV 

IGTV > TGV^ > TV 


Table 6. Regularise! performance with batch learning, and 
costs, noise variance a‘^ = 2; BSDS300 dataset, resized. 



SSIM 

1 PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.731 

0.120 

0.744 

8 

27.72 

0.88 

28.09 

2 

1.4e‘' 

2.5e'’ 

1.4e'’ 

0 

Liy-TY 

0.893 

0.035 

0.897 

23 

31.24 

1.87 

30.94 

23 

7.5E-* 

2.2e^ 

7.3e^ 

18 

L^V-IGTV 

0.897 

0.034 

0.902 

134 

31.36 

1.81 

31.11 

150 

7.4e'‘ 

2.2e^ 

7.2e^ 

107 

LjjV-TGV^ 

0.896 

0.035 

0.901 

35 

31.31 

1.88 

31.01 

25 

7.4e‘‘ 

2.3e^ 

7.2e^ 

75 

95% t-test 

IGTV > TGV^ > TV 

1 IGTV > TGV^ > TV 

IGTV, TGV^ > TV 

L'i-TY 

0.887 

0.035 

0.889 

29 

31.31 

1.50 

31.15 

25 

5.72 

1.91 

5.51 

25 

Li-IGTV 

0.889 

0.036 

0.893 

127 

31.41 

1.44 

31.28 

131 

5.57 

1.83 

5.37 

131 

Li-TGV^ 

0.888 

0.035 

0.891 

44 

31.38 

1.50 

31.20 

44 

5.64 

1.90 

5.44 

44 

95% t-test 

IGTV > TGV^ > TV 

1 IGTV > TGV^ > TV 

IGTV > TGV^ > TV 


Table 7. Regularise! performance with batch learning, and 
costs, noise variance = 10; BSDS300 dataset, resized. 



SSIM 

1 PSNR 

value 


mean 

std 

med 

best 

mean 

std 

med 

best 

mean 

std 

med 

best 

Noisy data 

0.505 

0.143 

0.516 

4 

21.80 

0.92 

22.14 

1 

2.8e'' 

7.9e^ 

2.8e'’ 

0 

Liy-TY 

0.789 

0.067 

0.798 

18 

27.37 

2.13 

26.98 

24 

I.Oe"' 

3.7e^ 

9.8e'‘ 

14 

L^V-IGTV 

0.795 

0.065 

0.804 

139 

27.46 

2.10 

27.05 

141 

I.Oe® 

3.6e^ 

9.6e-‘ 

91 

Lly-TGY^ 

0.794 

0.066 

0.804 

39 

27.44 

2.12 

27.04 

34 

I.Oe® 

3.7e^ 

9.6e-‘ 

95 

95% t-test 

IGTV > TGV^ > TV 

1 IGTV > TGV^ > TV 

TGV^ > IGTV > TV 

L'i-TY 

0.786 

0.053 

0.790 

31 

27.50 

1.71 

27.27 

33 

14.11 

5.78 

13.16 

33 

Li-IGTV 

0.790 

0.054 

0.790 

123 

27.56 

1.64 

27.37 

119 

13.84 

5.54 

12.75 

119 

Li-TGV^ 

0.789 

0.053 

0.793 

46 

27.55 

1.70 

27.33 

48 

13.93 

5.73 

12.95 

48 

95% t-test 

IGTV, TGV^ > TV 

1 IGTV, TGV^ > TV 

IGTV > TGV^ > TV 


Table 8. Regularise! performance with batch learning, and 
costs, noise variance a‘^ = 20; BSDS300 dataset, resized. 


both perform better than TV, in average. For denoising of images with a high noise 
level ICTV and TGV score comparably well. For images with large smooth areas 
TGV performs better than ICTV. 

Moreover, we propose a new cost functional for the bilevel learning problem, which 
exhibits interesting theoretical properties and has a better behaviour with respect to 
the PSNR related cost used previously in the literature. This study raises the 
question of other, alternative cost functionals. For instance, one could be tempted 
to used the SSIM as cost, but its non-convexity might present several analytical and 
numerical difficulties. The new cost functional, proposed in this paper, turns out to 
be a good compromise between image quality measure and analytically tractable cost 
term. 
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(a) Original (b) Individual LlV-TGV^ (c) Batch LlV-TGV^ 

PSNR=42.06, SSIM=0.98 PSNR=41.82, SSIM=0.98 



(d) Noisy, cr = 2 (e) Individual LiV-ICTV, (f) Batch L^^-ICTY, 

PSNR=42.13, SSIM=0.99 PSNR=4I.93, SSIM=0.98 

Figure 15. Image for which ICTV performs better than TGV^, cj = 2 
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